This version of practical manual includes output of the scritps and answers.

In this part of the course we will look into basic steps of the pathway enrichment analysis. There are many different methods, implementations and tools for performing this.

Some up-to-date web-based enrichment tools are, for example:

Here we will focus on ORA (overrepresentation analysis) and GSEA (Gene Set Enrichment Analysis), as implemented in ClusterProfiler R package [@Yu2012].

1 Preparation

1.4 Set seed for the R session

Some of the commands we use in this practical use random processes, such as random permutations. This means that each time the script is ran, the outcome may differ very slightly, just by chance. To make your code reproducible it is good idea to define the number “seed” in the beginning of the script. This way the script gives exactly the same output every time it is ran.

2 Prepare the data

2.1 Read in the data from RNA-seq differential analysis

We will read in the results of differential expression analysis from previous practical. We will concentrate on anti-CD3-stimulated dataset in this practical.

2.2 Prepare data for ORA and GSEA

Next we will do some data cleaning and convert gene IDs into usable format.

##       V1               baseMean         log2FoldChange       lfcSE      
##  Length:23228       Min.   :      0.0   Min.   :-5.661   Min.   :0.000  
##  Class :character   1st Qu.:      0.4   1st Qu.:-0.315   1st Qu.:0.185  
##  Mode  :character   Median :     40.4   Median :-0.008   Median :0.311  
##                     Mean   :   1517.1   Mean   : 0.036   Mean   :0.640  
##                     3rd Qu.:    982.8   3rd Qu.: 0.330   3rd Qu.:0.618  
##                     Max.   :1388488.9   Max.   : 5.517   Max.   :3.030  
##                                         NA's   :3287     NA's   :3287   
##       stat            pvalue           padj      
##  Min.   :-8.122   Min.   :0.000   Min.   :0.000  
##  1st Qu.:-0.930   1st Qu.:0.054   1st Qu.:0.130  
##  Median :-0.017   Median :0.319   Median :0.448  
##  Mean   : 0.067   Mean   :0.395   Mean   :0.456  
##  3rd Qu.: 1.061   3rd Qu.:0.725   3rd Qu.:0.763  
##  Max.   : 8.909   Max.   :1.000   Max.   :1.000  
##  NA's   :3287     NA's   :3287    NA's   :6357
##         V1 baseMean log2FoldChange lfcSE stat pvalue padj
## 1:   A4GNT        0             NA    NA   NA     NA   NA
## 2:    AA06        0             NA    NA   NA     NA   NA
## 3:  AACSP1        0             NA    NA   NA     NA   NA
## 4:   AADAC        0             NA    NA   NA     NA   NA
## 5: AADACL3        0             NA    NA   NA     NA   NA
## 6: AADACL4        0             NA    NA   NA     NA   NA
## 
## TRUE 
## 3287

We will remove the genes with mean base expression level 0, as those were not expressed in our data.

As you may see from the R message, ~10% of HGNC symbols were not connected with any ENTREZ ID. This is due to the unstable nature of HGNC annotation. In real-life science project you should opt for more stable annotation scheme (e.g. use one ENSEMBL/UCSC annotation database version throughout project).

For the sake of brevity and simplicity, in this practical we remove unlinked gene IDs from following steps and proceed with remaining 90% of genes.

We will prepare the data for ORA. In ORA we will use genes which are significantly differentically expressed (FDR<0.05, absolute fold-change>2). Fold change is often used as additional filter for identifying most relevant genes for interpretation.

Next we will prepare data for GSEA analysis. For that we need to order all genes based on their correlation strenght with the phenotype or effect size. In our case we will order those based on \(log_2(FC)\). We will also extract the vector of all tested genes which can be used as “gene universe”.

3 Over-representation analyses (ORA)

3.1 Run KEGG over-representation analysis

In this part we run enrichment test (hypergeometric test) for all differentially expressed genes (FDR<0.05, FC>2).

  • We use all genes tested in the RNA-profiling study as a background set or “gene universe”. This list was already constructed in the previous step.

  • We will query for all enrichment results and write out 25 most differentially expressed, not accounting the significance.

  • The default multiple testing is done by Benjamini-Hochberg method, flag pvalueCutoff can be used for filtering only significant results (<0.05). Additionally, qvalueCutoff flag is used to filter results based on another popular method of FDR estimation (Storey q-value).

  • The defaults of the command also apply restrictions on the sizes of gene sets tested in the analysis, those can be seen with command ?enrichKEGG.

We will accustom ourselves with the data structure of the analysis output.

The resulting object is more complex R S4 data structure which consists separate “containers” for different results, analysis settings, etc. You can access to those containers by using @.

##                ID                            Description GeneRatio
## hsa04060 hsa04060 Cytokine-cytokine receptor interaction    67/517
## hsa04640 hsa04640             Hematopoietic cell lineage    28/517
## hsa05321 hsa05321       Inflammatory bowel disease (IBD)    20/517
## hsa05310 hsa05310                                 Asthma    12/517
## hsa04630 hsa04630             JAK-STAT signaling pathway    31/517
##           BgRatio       pvalue     p.adjust       qvalue
## hsa04060 269/6779 3.658291e-19 1.101145e-16 9.704097e-17
## hsa04640  93/6779 1.045336e-10 1.573231e-08 1.386446e-08
## hsa05321  63/6779 1.873165e-08 1.879409e-06 1.656272e-06
## hsa05310  27/6779 2.053756e-07 1.545451e-05 1.361964e-05
## hsa04630 154/6779 3.958082e-07 2.203426e-05 1.941820e-05
##                                                                                                                                                                                                                                                                                                                                                                                                    geneID
## hsa04060 BMP6/BMPR2/CCL1/CCL24/CCL3/CCL4/CCR1/CCR7/CCR9/CD27/CD70/CSF1/CSF1R/CSF2/CSF3/CX3CR1/CXCL10/CXCL13/EDAR/GDF10/IFNG/IL10/IL12RB2/IL13/IL13RA1/IL16/IL17A/IL17F/IL18RAP/IL19/IL1R1/IL1R2/IL1RL1/IL1RN/IL21/IL22/IL23R/IL26/IL27/IL2RA/IL3/IL31/IL32/IL36A/IL4/IL5/IL7/IL7R/IL9/IL9R/LEPR/LIF/LIFR/LTA/OSM/TGFB2/TGFBR2/TNF/TNFRSF10D/TNFRSF11B/TNFRSF13B/TNFRSF18/TNFRSF4/TNFRSF8/TNFSF9/XCL1/XCL2
## hsa04640                                                                                                                                                                                                                                         CD33/CD36/CD38/CD9/CR1/CR2/CSF1/CSF1R/CSF2/CSF3/FCER2/HLA-DMB/HLA-DOA/HLA-DQA1/HLA-DQB1/IL1R1/IL1R2/IL2RA/IL3/IL4/IL5/IL7/IL7R/IL9R/ITGA5/ITGAM/TFRC/TNF
## hsa05321                                                                                                                                                                                                                                                                           HLA-DMB/HLA-DOA/HLA-DQA1/HLA-DQB1/IFNG/IL10/IL12RB2/IL13/IL17A/IL17F/IL18RAP/IL21/IL22/IL23R/IL4/IL5/JUN/MAF/TGFB2/TNF
## hsa05310                                                                                                                                                                                                                                                                                                                             HLA-DMB/HLA-DOA/HLA-DQA1/HLA-DQB1/IL10/IL13/IL3/IL4/IL5/IL9/PRG2/TNF
## hsa04630                                                                                                                                                                                                                              BCL2L1/CCND2/CSF2/CSF3/IFNG/IL10/IL12RB2/IL13/IL13RA1/IL19/IL21/IL22/IL23R/IL2RA/IL3/IL4/IL5/IL7/IL7R/IL9/IL9R/JAK2/LEPR/LIF/LIFR/OSM/PDGFA/PDGFB/SOCS1/SOCS3/STAT2
##          Count
## hsa04060    67
## hsa04640    28
## hsa05321    20
## hsa05310    12
## hsa04630    31
## [1] "KEGG"
## [1] 1

Now we look at the results by printing out 25 first rows of the result table. View function in RStudio makes it very comfortable.

ID Description GeneRatio BgRatio pvalue p.adjust qvalue Count
hsa04060 hsa04060 Cytokine-cytokine receptor interaction 67/517 269/6779 0.0000000 0.0000000 0.0000000 67
hsa04640 hsa04640 Hematopoietic cell lineage 28/517 93/6779 0.0000000 0.0000000 0.0000000 28
hsa05321 hsa05321 Inflammatory bowel disease (IBD) 20/517 63/6779 0.0000000 0.0000019 0.0000017 20
hsa05310 hsa05310 Asthma 12/517 27/6779 0.0000002 0.0000155 0.0000136 12
hsa04630 hsa04630 JAK-STAT signaling pathway 31/517 154/6779 0.0000004 0.0000220 0.0000194 31
hsa04110 hsa04110 Cell cycle 27/517 124/6779 0.0000004 0.0000220 0.0000194 27
hsa04061 hsa04061 Viral protein interaction with cytokine and cytokine receptor 21/517 89/6779 0.0000022 0.0000933 0.0000822 21
hsa05330 hsa05330 Allograft rejection 12/517 35/6779 0.0000056 0.0002103 0.0001854 12
hsa04380 hsa04380 Osteoclast differentiation 24/517 126/6779 0.0000219 0.0007322 0.0006452 24
hsa05140 hsa05140 Leishmaniasis 17/517 73/6779 0.0000243 0.0007322 0.0006452 17
hsa04658 hsa04658 Th1 and Th2 cell differentiation 19/517 90/6779 0.0000360 0.0009857 0.0008687 19
hsa04940 hsa04940 Type I diabetes mellitus 11/517 39/6779 0.0001068 0.0026799 0.0023617 11
hsa03460 hsa03460 Fanconi anemia pathway 12/517 47/6779 0.0001501 0.0034744 0.0030619 12
hsa03440 hsa03440 Homologous recombination 10/517 37/6779 0.0003229 0.0069430 0.0061187 10
hsa04672 hsa04672 Intestinal immune network for IgA production 11/517 45/6779 0.0004253 0.0085336 0.0075205 11
hsa05166 hsa05166 Human T-cell leukemia virus 1 infection 30/517 215/6779 0.0008299 0.0154437 0.0136101 30
hsa04659 hsa04659 Th17 cell differentiation 18/517 105/6779 0.0008745 0.0154437 0.0136101 18
hsa05320 hsa05320 Autoimmune thyroid disease 11/517 49/6779 0.0009235 0.0154437 0.0136101 11
hsa05332 hsa05332 Graft-versus-host disease 9/517 37/6779 0.0014636 0.0231868 0.0204339 9
hsa04657 hsa04657 IL-17 signaling pathway 15/517 86/6779 0.0019276 0.0276293 0.0243489 15
hsa05323 hsa05323 Rheumatoid arthritis 15/517 86/6779 0.0019276 0.0276293 0.0243489 15
hsa04151 hsa04151 PI3K-Akt signaling pathway 40/517 338/6779 0.0031992 0.0437704 0.0385736 40
hsa05169 hsa05169 Epstein-Barr virus infection 26/517 196/6779 0.0036937 0.0483396 0.0426003 26
hsa05418 hsa05418 Fluid shear stress and atherosclerosis 19/517 129/6779 0.0039472 0.0495045 0.0436270 19
hsa00240 hsa00240 Pyrimidine metabolism 10/517 51/6779 0.0045050 0.0542009 0.0477658 10

Q1: How many KEGG pathways are significant if we consider Benjamini-Hochberg P<0.05?

A: 12

Q2: How many KEGG pathways are significant if we consider Storey Q<0.05?

A: 12

Q3: Authors of the original paper [@Quinn2015, [link](http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0140049)] have already applied the KEGG overrepresentation test to the same data. Find the results from the publication and compare to your results- are those similar? If not exactly, speculate, what could be the reason(s) of observed differences?

A: Quite similar. Possible sources of difference:

-Possibly different ORA method.

-Possibly different database version.

-Possibly some difference in preprocessing of the data (e.g. normalization).

-Most probably different defaults of ORA (e.g. gene set size 10-500).

Q4: Use Fisher’s exact test to test the over-representation of the fifth most significant gene set enrichment result.

  • Google “Fisher’s Exact test R”

    • Construct the contigency table- you can use this code snippet to replace “NAs” with the correct numbers from the output table:
  • Tip: you can use command addmargins on your GeneMatrixFisher to see if your contigency table adds up.

    • Use the GeneMatrixFisher to run Fisher’s Exact Test.

Q4.1: Report the used command, Fisher’s exact test P-value and odds ratio.

A:

Pathway:

Cell cycle

Contingency table:

##                    Part_of_pathway Not_part_of_pathway
## Diff_expressed                  24                 100
## Not_diff_expressed             399                5757

Command:

fisher.test(GeneMatrixFisher, alternative = 'greater')

Output:

## 
##  Fisher's Exact Test for Count Data
## 
## data:  GeneMatrixFisher
## p-value = 1.9e-06
## alternative hypothesis: true odds ratio is greater than 1
## 95 percent confidence interval:
##  2.270164      Inf
## sample estimates:
## odds ratio 
##   3.461707

Q4.2: Does Fisher’s exact test give similar/same results as ClusterProfiler?

A: When considering rounding, yes. Hypergeometric test is actually identical with one-sided Fisher’s Exact Test.

3.1.1 Visualize the overall KEGG enrichment analysis results

We can visualize our results on a barplot. Length of the bar shows significance (\(-log_{10}(P)\)) and the color shows number of genes overlapping with gene set.

Next we visualize five most enriched pathways on the network graph. Size of the pathway node shows statistical significance and links indicate gene membership in the pathway. This gives the static graph as a output. If it is necessary to investigate further, you can use flag fixed = FALSE to see interactive version.

Q5: What are the most up- and down-regulated genes from 5 most enriched pathways, based on visual inspection?

A: Up-regulated: IFNG, down-regulated: KDR.

Q6: Is there any of the top pathways showing coordinated up- or down-regulation for the majority (or all) of its members?

A: Yes, “Cell cycle”, only one differentially expressed gene is down-regulated.

3.1.1.1 3.1.2 Visualize the most significantly enriched KEGG pathway

For KEGG pathways it is possible to visualize the location of differentially expressed genes on the pathway, as well as their magnitude of differential expression. This kind of visualization might help to identify most relevant genes for given condition and generate new research hypotheses.

For adding the plot to the HTML document: in the following command, replace the file name with the name of the file which was written to your work directory.

![]([YourPathName].pathview.png)

Q7: Same figure is also reported in the original publication [@Quinn2015, [link](http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0140049)]. Does it look similar or do you see any difference? Speculate, where can it come from?

A: Results are in quite good concordance with published results. There is some minor difference between two sets of results- this could be caused by the same reasons as explained in Q3.

3.2 Run Gene Ontology over-representation analysis

Next we run the overrepresentation test for Gene Ontologies.

  • For the sake of brevity, we test only the ontologies from Biological Process category. By setting the ont flag, is also possible to test Molecular Function (MF), Cellular Component (CC) or all three categories combined (ALL).

    • For visualization purposes we apply this time significance thresholds (Benjamini-Hochberg FDR<0.05 and Storey FDR<0.05).

    • We specify that we test the enrichment for the gene sets with sizes between 10-10000 genes (flags minGSSize and maxGSSize)

ID Description GeneRatio BgRatio pvalue p.adjust qvalue Count
GO:0002376 GO:0002376 immune system process 277/974 2561/14308 0 0 0 277
GO:0008283 GO:0008283 cell proliferation 203/974 1695/14308 0 0 0 203
GO:0034097 GO:0034097 response to cytokine 145/974 1077/14308 0 0 0 145
GO:0051716 GO:0051716 cellular response to stimulus 526/974 5959/14308 0 0 0 526
GO:0050896 GO:0050896 response to stimulus 612/974 7260/14308 0 0 0 612
GO:0071345 GO:0071345 cellular response to cytokine stimulus 135/974 1001/14308 0 0 0 135
GO:0070887 GO:0070887 cellular response to chemical stimulus 286/974 2798/14308 0 0 0 286
GO:0042127 GO:0042127 regulation of cell proliferation 170/974 1414/14308 0 0 0 170
GO:0000278 GO:0000278 mitotic cell cycle 118/974 851/14308 0 0 0 118
GO:0010033 GO:0010033 response to organic substance 288/974 2839/14308 0 0 0 288
GO:1903047 GO:1903047 mitotic cell cycle process 107/974 740/14308 0 0 0 107
GO:0071310 GO:0071310 cellular response to organic substance 247/974 2331/14308 0 0 0 247
GO:0009605 GO:0009605 response to external stimulus 217/974 1968/14308 0 0 0 217
GO:0007166 GO:0007166 cell surface receptor signaling pathway 258/974 2498/14308 0 0 0 258
GO:0006955 GO:0006955 immune response 192/974 1729/14308 0 0 0 192

3.2.1 Visualize the GO enrichment analysis results

To ease the interpretation and see the hierarchical relationships between most enriched GO terms, we visualize those on graph structure. Square boxes are significantly enriched terms and color depicts the significance (P-value). We will visualize 15 most significant terms.

## $dag
## A graphNEL graph with directed edges
## Number of Nodes = 26 
## Number of Edges = 39 
## 
## $complete.dag
## [1] "A graph with 26 nodes."

Q8: What is the most significant GO term?

A: Cell proliferation

Q9: What are the main general themes coming out of the analysis?

A: Look at the hierarchical structure indicates mainly immune-related processes and cell proliferation.

3.3 Run Disease Ontology over-representation analysis

Clusterprofiler allows to do also ORA for human diseases, making use of several databases of curated gene-disease relationships. As our trait of interest is Coeliac disease, it would be interesting to see differentially expressed genes also show enrichment of some human diseases and whether those are could be related with Coeliac disease or any other autoimmune disease.

Q10: If you see that the genes differentially expressed in your disease-of-interest (in our case Coeliac disease) are enriched by genes known to be associated with some other disease (e.g. inflammatory bowel disease, another autoimmune disease), what is the biological conclusion you make? Could this observation be useful for the search of the treatment and how would you proceed with that?

A: This could mean that there might be some similar mechanisms playing role in the pathogenesis of those two diseases. You can imagine that if there is known treatment for disease which comes significant, it might point you to leads for new potential drugs/treatments.

We will run DO enrichment analysis by asking all the results (no filtering based on FDR) and otherwise with default settings (minGSSize = 10, maxGSSize = 500).

ID Description GeneRatio BgRatio pvalue p.adjust qvalue Count
DOID:74 DOID:74 hematopoietic system disease 81/610 433/6810 0e+00 0.00e+00 0.0e+00 81
DOID:850 DOID:850 lung disease 82/610 456/6810 0e+00 1.00e-07 0.0e+00 82
DOID:0050161 DOID:0050161 lower respiratory tract disease 83/610 465/6810 0e+00 1.00e-07 0.0e+00 83
DOID:1176 DOID:1176 bronchial disease 36/610 137/6810 0e+00 3.00e-07 2.0e-07 36
DOID:2841 DOID:2841 asthma 32/610 119/6810 0e+00 1.20e-06 7.0e-07 32
DOID:3388 DOID:3388 periodontal disease 32/610 121/6810 0e+00 1.50e-06 9.0e-07 32
DOID:824 DOID:824 periodontitis 29/610 104/6810 0e+00 1.70e-06 1.0e-06 29
DOID:2320 DOID:2320 obstructive lung disease 54/610 282/6810 0e+00 3.90e-06 2.3e-06 54
DOID:3905 DOID:3905 lung carcinoma 77/610 467/6810 0e+00 4.10e-06 2.4e-06 77
DOID:2723 DOID:2723 dermatitis 37/610 165/6810 1e-07 7.70e-06 4.5e-06 37
DOID:5082 DOID:5082 liver cirrhosis 42/610 201/6810 1e-07 8.00e-06 4.7e-06 42
DOID:37 DOID:37 skin disease 57/610 316/6810 1e-07 8.90e-06 5.2e-06 57
DOID:16 DOID:16 integumentary system disease 62/610 356/6810 2e-07 8.90e-06 5.2e-06 62
DOID:104 DOID:104 bacterial infectious disease 47/610 243/6810 2e-07 1.28e-05 7.5e-06 47
DOID:3310 DOID:3310 atopic dermatitis 31/610 130/6810 3e-07 1.34e-05 7.9e-06 31

Q11: Does resulting significant disease list make sense in relation to Coeliac disease? Are there any diseases which could show similar expression profile as Coeliac disease? If not, speculate, what could be the reason of ambigous results?

A: Not really. It is not easy to interpret e.g. lung diseases in relation with Coeliac disease. Maybe “asthma” and “atopic dermatitis” might be in line as those are also immune-related diseases. However, you may see that most significant results are indicating quite general diseases (e.g. “lung disease”, “skin disease”, etc.) with >100 genes linked to the disease. This is due to hierchical nature of disease ontology, where there are also very broad “diseases” combining many potentially looseli related pathologies. This gives also quite broad enrichment results, as there are actually few “narrow-defined” diseases with so large number of known disease-related genes. It might be more informative to enforce upper limit for the size of tested gene sets- you can experiment with maxGSSize. Hint: Setting maxGSSize e.g. to 100, gives much more easily interpretable results, with immune-related diseases in top. However, you should always reason the choice of analysis settings based on analytical consideration (ideally before running the analysis), not based on which setting gives you the most appealing results!

4 Gene set enrichment analysis (GSEA)

Next we will use GSEA [@Subramanian2005] on the diffrential analysis results. Unlike ORA, GSEA uses all the results from differential expression analysis, not only significant ones.

4.1 Run GSEA (Gene Set Enrichment Analysis) for KEGG pathways

For the sake of speed and brevity, we use recommended minimal number of 1000 permutations for this analysis (to get more stable and precise results your could increase the number of permutations to e.g. 10000 in real scientific work). Also, we test only KEGG pathways which have more than 50 known gene members.

ID Description setSize enrichmentScore NES pvalue p.adjust qvalues rank
hsa04060 hsa04060 Cytokine-cytokine receptor interaction 269 0.5017458 1.970281 0.0012563 0.0279133 0.0218228 1276
hsa04062 hsa04062 Chemokine signaling pathway 178 0.3968676 1.498672 0.0013316 0.0279133 0.0218228 1370
hsa04630 hsa04630 JAK-STAT signaling pathway 154 0.5442419 2.011965 0.0013812 0.0279133 0.0218228 1408
hsa04110 hsa04110 Cell cycle 124 0.6088459 2.190637 0.0013966 0.0279133 0.0218228 2650
hsa04114 hsa04114 Oocyte meiosis 114 0.4869681 1.733099 0.0014124 0.0279133 0.0218228 2405
hsa04620 hsa04620 Toll-like receptor signaling pathway 99 0.4771873 1.675047 0.0014124 0.0279133 0.0218228 2062
hsa04061 hsa04061 Viral protein interaction with cytokine and cytokine receptor 89 0.5492421 1.893031 0.0014493 0.0279133 0.0218228 1260
hsa04914 hsa04914 Progesterone-mediated oocyte maturation 89 0.4811576 1.658370 0.0014493 0.0279133 0.0218228 2268
hsa04657 hsa04657 IL-17 signaling pathway 86 0.6202825 2.124766 0.0014706 0.0279133 0.0218228 1501
hsa05132 hsa05132 Salmonella infection 74 0.6001129 2.021678 0.0015221 0.0279133 0.0218228 970
hsa05321 hsa05321 Inflammatory bowel disease (IBD) 63 0.5715412 1.871064 0.0015480 0.0279133 0.0218228 1006
hsa01212 hsa01212 Fatty acid metabolism 51 0.5957736 1.856027 0.0016260 0.0279133 0.0218228 4217
hsa05162 hsa05162 Measles 135 0.4156609 1.505729 0.0027739 0.0345734 0.0270298 3205
hsa01200 hsa01200 Carbon metabolism 109 0.4589161 1.630560 0.0027855 0.0345734 0.0270298 4440
hsa04742 hsa04742 Taste transduction 66 -0.4426960 -1.593388 0.0028490 0.0345734 0.0270298 3200

Q10: How many pathways are significant if we consider Benjamini-Hochberg FDR?

A: 11

Q11: Are the results in line with ORA results?

A: Generally yes.

4.1.1 Visualize GSEA plot for most significant KEGG pathway

## [1] "Cytokine-cytokine receptor interaction"

5 Compare the biological themes of different stimulations

It is possible to easily compare the results several ORA analyses, using the functionality from compareCluster results. Next we read in and preprocess differentially expressed genes from all three conditions (UN-CeD vs control, unstimulated; CD3-CeD vs control, anti-CD3 stimulation; PMA-CeD vs control, PMA stimulation). We will run the KEGG ORA, using default settings and all genes tested in the study as an background.

# Read in and preprocess all different conditions:

# Unstimulated
unstim <- fread('/Users/NPK/UMCG/git_projects/Teaching-modules/Practicals/differential-expression-and-pathway-analysis/Results/diffExpGenes_unstim_all.csv')

unstim <- unstim[!unstim$baseMean == 0, ]

entrez <- bitr(unstim$V1, fromType = "SYMBOL", toType = "ENTREZID", OrgDb = "org.Hs.eg.db")
unstim <- merge(unstim, entrez, by.x = 'V1', by.y = 'SYMBOL')
unstim_overrep <- unstim[unstim$padj < 0.05 & abs(unstim$log2FoldChange) > 1, ]

# CD3
cd3 <- fread('/Users/NPK/UMCG/git_projects/Teaching-modules/Practicals/differential-expression-and-pathway-analysis/Results/diffExpGenes_cd3_all.csv')

cd3 <- cd3[!cd3$baseMean == 0, ]

entrez <- bitr(cd3$V1, fromType = "SYMBOL", toType = "ENTREZID", OrgDb = "org.Hs.eg.db")
cd3 <- merge(cd3, entrez, by.x = 'V1', by.y = 'SYMBOL')
cd3_overrep <- cd3[cd3$padj < 0.05 & abs(cd3$log2FoldChange) > 1, ]

# PMA
pma <- fread('/Users/NPK/UMCG/git_projects/Teaching-modules/Practicals/differential-expression-and-pathway-analysis/Results/diffExpGenes_pma_all.csv')

pma <- pma[!pma$baseMean == 0, ]

entrez <- bitr(pma$V1, fromType = "SYMBOL", toType = "ENTREZID", OrgDb = "org.Hs.eg.db")
pma <- merge(pma, entrez, by.x = 'V1', by.y = 'SYMBOL')
pma_overrep <- pma[pma$padj < 0.05 & abs(pma$log2FoldChange) > 1, ]

input_comparison <- list(UNST = unstim_overrep$ENTREZID,
                         CD3 = cd3_overrep$ENTREZID,
                         PMA = pma_overrep$ENTREZID)

ck <- compareCluster(geneCluster = input_comparison, fun = "enrichKEGG",
                     universe = unstim$ENTREZID, 
                     pvalueCutoff = 1,
                     qvalueCutoff = 1)

We use dplyr and ggplot2 functionality to compare the 20 most significant pathways for each group. Here, the size of the dot indicates significance (\(-log_{10}(P)\)) and red dots indicate pathways withstanding multiple testing correction.

Q11: Are there any KEGG pathways overlapping between any of the conditions?

A: Yes, “Cytokine-cytokine receptor interaction”, “Inflammatory bowel disease”, between two stimulations.

Q12: What condition has the largest number of significantly enriched pathways?

A: Anti-CD3 stimulation.

Q13: Are the per-cohort results in line with the ORA results reported in the paper?

A: Generally yes, but there are also some differences.

6 Report the session information

Finally, when wrapping up your analysis, it is always a good idea to save the settings you have used. This involves software versions and the date of the analysis, making it possible to replicate or debug the results by different person.

Command sink is useful for saving any output from the screen into file.

Investigate the resulting file and note what information is saved.

7 Bonus Exercise

Next advanced tasks are meant to quickest students who have already finalized previous analyses.

ClusterProfiler has a very useful functionality to allow performing ORA and GSEA for any gene set database. This means that you can define the gene set yourself or use some custom gene set database which is not explicitly implemented to ClusterProfiler.

We will try to use this functionality on anti-CD3 treated Coeliac disease differential expression results.

Task 1 Investigate ClusterProfiler manual here to identify which commands allow you to perform universal ORA and GSEA. Note what extra file(s) are needed to do so.

Task 2 There are numerous gene sets available and downloadable on the web tool enrichr website (http://amp.pharm.mssm.edu/Enrichr/). Locate the downloadable databases from the web site, investigate what kind of gene sets are available and select one you would like to test in your data. Also, think what gene set dataset gives interesting information about differentially expressed genes in Coeliac disase. Download the desirable file.

NB! Some of the data files might not work- then just make another choice for now. Also, some of the datasets include numerical values after each gene name- skip those for now.

Task 3 Unfortunately, the data files are not in the “R-friendly” format. The biggest challenge in this bonus task is to read the file in and convert it to the usable data.frame. It should finally look like that:

term gene
term1 entrezID1
term2 entrezID2

Note that column names should be “term” and “gene”, and gene name should be ENTREZ ID.

  • Google and investigate read.table documentation, how to read tables with variable numbers of elements in each row into R.

    • Hint!: arguments like fill, na.strings, col.names are needed. Also the usage of command count.fields is mandatory.

    • If you manage to read the table into R, apply melt command from reshape2 package to convert data to long format, as you have previously learned in the R part. Install reshape2, if necessary.

    • Remove unnecessary column(s) and rows where gene name is “NA”. Make use of !is.na() for that.

    • Convert HGNC symbols to ENTREZ IDs, as we have done before. Do not forget to convert ENTREZ ID to character.

    • Merge the ENTREZ IDs to the intial table, manipulate the table to the format we need. Use the tricks you have learned today and on previous days for that.

Task 4 Run universal ORA, using differentially expressed genes from CD3 stimulation and your own gene set database. Put P-value and Q-value cutoffs to 1, to investigate top results even if none are significant.

Q: Did you find any significant and/or interpretable results?

Task 5 Run universal GSEA, using all ranked genes from CD3 stimulation and your own gene set database. Put P-value cutoff to 1, to investigate top results even if none are significant.

Q: Did you find any significant and/or interpretable results?

Task 6 Visualize your ORA and GSEA results using plots we have constructed today.

Convert the data into usable format.

Run ORA

Run GSEA


8 References